Poodle: Pandas + Sklearn, Tutorial V02

Sung-Jin Kim, Apr 11, 2016

Pandas is a wonderful framework for data management. Also, Sklearn is a powerful tool for machine learning. However, there is no one which mix them up. Poodle is a package which mixes Pandas and Sklearn in order to give convenience and power at the same time in Data science.

Previously, you should read data from a file first before performing machine learning unless it is generated online. In Poodle, you don't need to read your data using separate steps. The machine learning tools in Poodle will read data from a file if it is needed. Moreover, your data sheet files are always synchronized with your machine learning operation. Therefore, you can keep monitoring your data during your operation for machine learning.

This is a basic tutorial illustrating how to use Poodle although this tutorial is underdevelopment. By reading this document, you can realize what poodle will do. As a data scientist, I feel very tried always when I performance machine learning because of two reasons mainly. First, as I mentioned above there should be steps to load data into memory for machine learning. Second, after data is loaded into memory, it is really hard to monitor the processing of data during performing machine learning. Therefore, you often save your data in each step by yourself. It makes tiresome in you processing. Poodle will be a solution to make this step removal and make that process all automatic.

Below, the simple usage of Poodle is introduced using an example. The example is linear regression with dummy data. In order to test this example by yourself. You need two files, which are poodle codes in poole/linear_model.py and dumy data in sheet/xy_pdl.csv.

Notice that even if the input data file format follows CSV, the actual format is extension of CSV. You should use ID, X, y as special keywords where X and y represents a feature array and a target array as used in Sklearn. Moreover, ID represents the index column which will be index of DataFrame(). Interestingly, feature names such as x1, x2, x3 below X are not determined so that you can use any words for them. It gives great flexibility in your processing. The y1, y2 below y are the same, you can use 'left', 'right' instead of 'y1', 'y2' according to your project for running machine learning. Similarly, the index names below ID are also flexible. Now, numeric values sorted are used but you can use any words for them regardless of ordering or not.

Importing packages only for the tutorial purpose.

They are not needed for Poodle unless other usage is expected.


In [2]:
from importlib import reload
import sklearn.linear_model
import pandas as pd
import numpy as np

Start from a simple linear regression

Let us start from linear regression which is a simple but widely using machine learning method.

Learning and Prediction

In Poodle, linear_model() can be imported like that in Sklearn. In Sklearn, input and output data are variables while Poodle support a CSV file based on Pandas DataFrame().


In [3]:
from poodle import linear_model
reload( linear_model)


Out[3]:
<module 'poodle.linear_model' from '/home/jamessungjinkim/Dropbox/Aspuru-Guzik/python_lab/py3/jamespy_py3/poodle/linear_model.py'>

As it is metioned before, the command in Sklearn for LinearRegression can be used except that the input data are not arrays any longer. Instead, they are data in a CSV file. Hence, you can give a file name instead of X, y as arrays.

  • fit() is modified function for special purpose such as loading input data from a file.

In [4]:
ml = linear_model.LinearRegression()
ml.fit('sheet/xy_pdl.csv')


Out[4]:
LinearRegression()

Now every other operations are the same to the commands in orginal LinearRegression method. You can predict for new input data. Now additional input data is not a file. It will be updated to use file later on. After that, you can specify traning data on fit() while testing on predict().

  • predict() is the same function in Sklearn. It is a parenet function by class. Below, predict() from Sklearn is used to show the possibilty of it is shown.

In [5]:
ml.predict( 'sheet/x_pdl.csv', 'sheet/yp_pdl.csv')

Investigating Formats of Data Sheets

In Poodle, some format in a data sheet must be followed. Otherwise, the operation for machine learning will not be working.

  • To make your input data, you may refer to an example data sheet of 'sheet/xy_pdl.csv'.

In [6]:
linear_model.read_csv( 'sheet/xy_pdl.csv')


Out[6]:
X y
0 1 2 0 1
ID
0 1 2 3 6 20
1 4 5 6 15 47
2 7 8 9 24 74
3 4 5 8 17 55
4 8 9 4 21 59

In [7]:
linear_model.read_csv( 'sheet/yp_pdl.csv')


Out[7]:
X yp
0 1 2 0 1
ID
0 1 2 3 6.0 20.0
1 4 5 6 15.0 47.0
2 7 8 9 24.0 74.0
3 4 5 8 17.0 55.0
4 8 9 4 21.0 59.0

Other functions in LinearRegression() and other tools in Sklearn will be included in Poodle step by step.


In [16]:
reload( linear_model)
gs = linear_model.GridSearchCV()

In [17]:
gs.fit( 'sheet/xy_pdl.csv', 'sheet/gs_pdl.csv')


Out[17]:
GridSearchCV(estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
       param_grid={})

In [18]:
linear_model.read_csv( 'sheet/gs_pdl.csv')


Out[18]:
X y yp
0 1 2 0 1 0 1
ID
0 1 2 3 6 20 6.0 20.0
1 4 5 6 15 47 15.0 47.0
2 7 8 9 24 74 24.0 74.0
3 4 5 8 17 55 17.0 55.0
4 8 9 4 21 59 21.0 59.0

In [ ]: